Shortest-Path Graph Kernels for Document Similarity
نویسندگان
چکیده
In this paper, we present a novel document similarity measure based on the definition of a graph kernel between pairs of documents. The proposed measure takes into account both the terms contained in the documents and the relationships between them. By representing each document as a graph-of-words, we are able to model these relationships and then determine how similar two documents are by using a modified shortest-path graph kernel. We evaluate our approach on two tasks and compare it against several baseline approaches using various performance metrics such as DET curves and macro-average F1-score. Experimental results on a range of datasets showed that our proposed approach outperforms traditional techniques and is capable of measuring more accurately the similarity between two documents.
منابع مشابه
CLAIRLIB Documentation v1.03
The Clair library is intended to simplify a number of generic tasks in Natural Language Processing (NLP), Information Retrieval (IR), and Network Analysis. Its architecture also allows for external software to be plugged in with very little effort. Functionality native to Clairlib includes Tokenization, Summarization, LexRank, Biased LexRank, Document Clustering, Document Indexing, PageRank, Bi...
متن کاملA Graph Based Authorship Identification Approach: Notebook for PAN at CLEF 2015
The paper describes our approach for the Authorship Identification task at the PAN CLEF 2015. We extract textual patterns based on features obtained from shortest path walks over Integrated Syntactic Graphs (ISG). Then we calculate a similarity between the unknown document and the known document with these patterns. The approach uses a predefined threshold in order to decide if the unknown docu...
متن کاملA Shortest Path Similarity Matrix based Spectral Clustering
This paper proposed a new spectral graph clustering model by casting the non-categorical spatial data sets into an undirected graph. Decomposition of the graph to Delaunay graph has been done for computational efficiency. All pair shortest path based model has been adapted for the creation of the underlying Laplacian matrix of the graph. The similarity among the nodes of the graph is measured b...
متن کاملGeneralized Shortest Path Kernel on Graphs
We consider the problem of classifying graphs using graph kernels. We define a new graph kernel, called the generalized shortest path kernel, based on the number and length of shortest paths between nodes. For our example classification problem, we consider the task of classifying random graphs from two well-known families, by the number of clusters they contain. We verify empirically that the ...
متن کاملA Shortest Path Dependency Kernel for Relation Extraction
We present a novel approach to relation extraction, based on the observation that the information required to assert a relationship between two named entities in the same sentence is typically captured by the shortest path between the two entities in the dependency graph. Experiments on extracting top-level relations from the ACE (Automated Content Extraction) newspaper corpus show that the new...
متن کامل